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Abstract — This article compares different estimation methods 
specially designed to combat the problem induced by 
multicollinearity using real life data of different specifications 
and distributions. From the mean squared error of the samples 
studied we observed that Partial least square came up as the best 
estimator among the methods we studied. Stepwise regression 
performs better when the predictor variables are highly 
correlated. Under the ridge regression study the smallest 
eigenvalue of the predictor variables of the original data was 
used in determining the ridge parameter of ridge regression 
since the variances of some of our samples cannot be estimated 
by ordinary least squares regression. From our results we found 
that among all the methods we studied PLSR estimator stands 
the “best”, followed by the stepwise regression and then the 
PCR estimator in predicting the response variable. We are not 
surprise that RR estimator stands the least among the methods 
since it is known as biasing estimator and more useful in 
estimating the parameters of the model. We also wish to state 
that PLSR is efficient in prediction when the sample size is very 
small. 

Index Terms — Multicollinearity, Principal component 
regression, Eigen value, Partial least squares, Ridged regression, 
Nonorthogonal data and Stepwise regression. 


I. INTRODUCTION 

The term multicollinearity is used to denote the existence of 
a perfect or exact, linear relationships (or near perfect 
relationships) among some or all explanatory variables of 
regression model [1]. If the explanatory variables are 
perfectly correlated, that is, if the correlation coefficient for 
these variables is equal to unity, the parameters become 
indeterminate: it is impossible to obtain numerical values for 
each parameter separately and the method of least squares 
breaks down. Multicollinearity may also be induced by the 
choice of model, for instance, the addition of polynomial 
terms to a regression model may cause ill-conditioning in 

rn ^ 

X'X. Furthermore if the range of a is small, adding an a 
term can result in severe multicollinearity and also if the 
number of explanatory variables are more than the sample 
size LS method may produce misleading result. 

Several techniques have been proposed for dealing with the 
problems caused by multicollinearity. The general approach 
include the collection of additional information, model 
re- specification and the use of estimation methods specially 
designed to combat the problem induced by multicollinearity. 
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Collecting additional information is not always possible 
because of economic constraint or because the process being 
studied is no longer available for sampling. Even when the 
additional data are available, it may be inappropriate to use if 
the new data extends the range of interest. Furthermore, if the 
new data points are unusual or atypical of the process being 
studied, their presence in the sample could be highly 
influential on the fitted model. Finally, it is good to note that 
the addition of more data is not a valid solution to the problem 
of multicollinearity especially when the multicollinearity is 
due to constraint on the model or in the population. 

Some re specification of the regression equation may lessen 
the impact of multicollinearity especially when it is caused by 
the choice of model. Model respecification done by either 
redefining the regressors or by variable elimination may not 
provide a satisfactory solution if the new model does not 
preserve the information contained in the original data and or 
if the regressors dropped from the model have significant 
explanatory power relative to the response variable. 

With the impending dangers of the two scenarios 
discussed, this paper aimed at discussing and comparing 
different estimation methods designed to solving the 
problems of multicollinearity. The methods include Principal 
component regression, Partial least squares, ridged 
regression, and stepwise regression. Three different types of 
multicollinear data ( when the sample size is smaller than or 
equal to the number of the predictor variables, where the 
predictor variables are highly correlated, and when the 
polynomials terms are added to the model) were studied with 
intention of finding the best method for each data type. 

II. Estimation methods 
A. Partial Least Squares 

Partial least squares (PLS) is a method for constructing 
predictive models when the factors are many and highly 
collinear [2] . Emphasis of PLS is on predicting the responses 
and not necessarily on trying to study the underlying 
relationship between the variables. For example, PLS is not 
usually appropriate for screening out factors that have a 
negligible effect on the response. However, when prediction 
is the goal and there is no practical need to limit the number of 
measured factors, PLS can be useful tool. 

PLS can be applied in monitoring industrial processes; a large 
process can easily have hundreds of controlling variables and 
dozens of outputs. 

Multiple linear regression can be used with very many factors. 
However, if the number of factors gets too large (for example, 
greater than the number of observations), you are likely to get 
a model that fits the sampled data perfectly but that will fail to 
predict new data well. This phenomenon is called over-fitting. 
In such cases, although there are many manifest factors, there 
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may be only a few underlying or latent factors that account for 
most of the variation in the response. The general idea of PLS 
is to try to extract these latent factors, accounting for so much 
of the manifest factor variation as possible while modeling the 
responses well 

The aim of partial least squares is to predict the response by a 
model that is based on linear transformation of the 
explanatory variables. Partial least squares (PLS) is a method 
of constructing regression models of type 
y = J ?0 + ftXi + ^ 2^2 + "" + PpTp 

1 

Where the T- L are linear combination of the explanatory 
variables A r L , X z , .... X- K such that the sample correlation for any 
pair 7[, 7J (i, j) is 0. Following the procedures given in [3], all 
the data are first centered. Let y, x v ... , x ^ denote the sample 


means of the T x (A T + 1) -data matrix 
(y,X) = (y 1# x L , ... , x 
and denote the variables 

E/i = Y - yi 2 

^ ^ (for i k) 3 

then the data values are the T-vectors 

u L = y-y v = Oj 4 

v il - x i~ x iV (Fji = Oj 5 


The linear combination T ( - called factors, latent variables, or 

components, are then determined sequentially. The procedure 
is as follows: 

i. U t is first regressed against i , then regressed against 
V 12 ,. . ., then regressed against 7 * . Then univariate 
regression equations are 


ffii = (i - 1- -—ft)- * 6 

where £iu = 1 lL ^ 1 

l-iL^iL 

Then each of the k equations in *(6) provides an estimate 
of U L . To have one resulting estimate, one may use a 
simple average Sf =1 w L ft^V^/ft or the weighted average 
like 

x 

T'l = ? 

L=i 


with the data value 

x 



! =1 


ii. The variable should be a useful predictor of E-S and 
hence of Y. The information in the variable Y[ that is 
not in may be estimated by the residuals from a 
regression of Af,; on 7^ which are identical to the 
residuals, say if 7 l L - is regressed on T v that is 


^ = Vl.i ~ 


E'l^l j 
*1*1 


*i 


To estimate the amount of variability in Y that is not 
explained by the predictor one 

may regress on and take the residuals, say U 2 . 
iii. Define now the individual predictors 

U 2i = b 2i V 2i (t=l k) 10 


11 


and the weighted average 

x 

= X 

■'= i 

iv. General iteration step 

Having performed this algorithm k times, the 
remaining residual variability in Y is E/^i and the 

residual information in A^ is , 

where 


£4 + i = v- K ~- 


E k ' u k 


t Jet* 


n 


12 


and 


v (k+tii = v ki ~ 


t'lctjc 


T- 

J- _ 




13 


Regressing E/^ against for /= 

gives the individual predictors 


v. 




with 

^(k+i)i = 


v rk+ni u k+i 


v (k+L)i^[k + l)i 


(K+ty T (k + vj i 


14 


and the (k+l)th component 

A r 


^1+L — u ~ fk + V)jh(k+ l) i^k + 1) ■ 

[=1 


15 


Suppose that this process has stopped in the pth step, 
resulting in the PLS regression model given in (1). 
The parameters /V are estimated by 

univariate OLS. This can be proved as follows. 

In matrix notation we may define 



Id 

— Cft — L-.-j 

ti 17 

w [k) - Cw kl , ... ' ^ ' Cft = L , p) 

IS 

^(k) = ^[k) w [k) ft = 1' — 

19 

.t _ .T v c*-i 

v {k-\ ~ v (k-i) ^.r . £ k-i 

L k-l r k~l 

20 

By construction the sample residual + are 

orthogonal to 


■ i ^ i - . .1^ 

■■ fk — ■ 


v v - , implying that ^ l k) v [p = 0 for ft ^ j , 


KL' v [ft— 1)1' 

hence, = 0 for k ^ j, and finally’ 

t' k tj = 0 k±j 


The well know feature of the PLS is that the sample 

components L are pairwise uncorrelated. The simple 

consequence is that parameters Pk may be estimated by 

simple univariate regression of Y against T - K . Furthermore, 

the preceding estimates p- K stay unchanged if a new 
component is added. 


where 



v 2i u 2 


a n .1 n 

- 2l v 2l 


B. Principal Component Regressio 

Principal component regression is a regression procedure 
used in the presence of multicollinearity among the k 
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dimensional random vector of the predictor variables X. That 
is, where the matrix of X is not of full rank (rank (X j < k) or 
when the number of predictor variables are more than the 
sample size. 

The model of principal component regression is of the form 

y = XPPjB Iff 21 

This can as well be written as 


y = Xp + g 22 

where X — XP f and /? = P '8 
Let the column of the orthogonal matrix P = 
of the eigenvectors of XX be numbered according to the 
magnitude of the eigenvalues > A 2 >, f A k . Then the 
X{ = Xp[ is the ith principal component and we get 

xx = pe XXft = A 23 

We now assume exact multicollinearity. Hence 
rank QO = k - j with j > 1. We get 


A r -;-i 


= A k =Q 24 


According to the subdivision of the eigenvalues into the 
groups Jli > ■■ > A x _ r > 0 and = A k = 0, we 

define the subdivision 


s = 



'/?■ 
% 


Mi 


0 

o' 


with ;A = 0 as in (24). We now define 

= « + * 


The OLS estimate f the C K — f) -vector 3 t is 
b t - (X X)~ A y. The OLS estimate of the full vector 3 is 

(g 1 )= Orx}-%y 

= (PA~ l P)x r y 25 


with 



being a generalized inverse of A 
C. Ridge Regression 

When the method of least squares is applied to 
nonorthogonal data, very poor estimate of the regression 
coefficients are usually obtained, the variance of the Least 
square (LS) estimates of the regression coefficient may be 
considerably inflated, and the length of the vector of least 
squares parameter estimates is too long on the average [4]. 
This implies that the absolute value of the least squares 
estimates are too large and that they are very unstable, that 
their magnitude and signs may change considerably given a 
different sample, 

The problem with the method of LS is the requirement that the 
estimator of 8 should be unbiased. The Gauss-Markov 

r 

property of regression parameter assures us that the LS 
estimator 8 of 8 has minimum variance in the class of 

i i 

unbiased linear estimators without guarantee that the variance 
will be small. If the variance of /? is large, it implies that 
confidence interval on /? would be wide and the point 
estimate /? is very unstable. 


One way to alleviate this problem is to drop the requirement 
that the estimator of j 8 be unbiased. [5] and [6] proposed a 
biased estimator (ridge estimator) of 8, 


0= OCX + kf)- l X'y 26 

that has a smaller variance than the unbiased estimator/?. 
This ridge estimator is a linear transformation of the LS 
estimator since 

0- OCX + kO~ l X’y 

= Otx + kf)~Hx'X)p 

= Z-J 21 



where S and & A are found from the least square 


solution. 

[7] stated that both the mean squared error and the smallest 
eigenvalue of the predictor variables of the original data play 
vital role in determining the biased parameter (k) of ridge 
regression. [9] showed through simulation that the resulting 
ridge estimator had significant improvement in mean squares 
error (MSE) over LS. 

The mean square error of the estimator S is defined as 

MSE (/) = Elfl-py 

= V{g) + [£(/) - p] - 

MSE [ S = var (jf) +■ (bias in. $ ) 28 


Note that the MSE is just the expected squared distance from 
5 to/?. By allowing a small amount of bias in /?, the variance 
of 3 can be made small such that the MSE of 3 is less than the 
variance of the unbiased estimator /?, Consequently 
confidence interval on /? would be much narrower using the 
biased estimator. The small variance for the biased estimator 
also implies the 3 is a more stable estimator of 8 than the 
unbiased estimator /?. 


D. Stepwise Regression 

In deciding on the “best” set of explanatory variable for a 
regression model, researchers often follow the method of 
stepwise regression. In this method the ordinary least square 
(OLS) regression of the variables are performed by 
introducing the X variables one at a time (stepwise forward 
regression) or by including all the possible X variables in one 
multiple regression and rejecting them one at a time (stepwise 
backward regression). The decision to add or drop a variable 
is usually made on the basis of the contribution of the variable 
to the error sum of squares (ESS) of the F test. 


III. The data sets and their results 

To compare the performance of the methods that we have 
considered seven different real data sets were studied to 
investigate their effectiveness at predicting response variable 
using their mean squares error (MSE). Attempt was also made 
to see how a biased regression estimator (ridge regression) 
competes with other estimators we studied. The data sets 
studied includes: a data sets that contains predictor variable 
that are highly correlated, this data set is from Nigeria Stock 
Exchange and is based on their transaction for the period of 
1991-2007. The data is available at [8]; a data set from 
chemometric study where the number of predictors are far 
more than the sample size; and a data set with polynomials of 
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different degrees of predictor variable. A case of time series 
data was also considered where time is included as a predictor 
variable. This data is obtained from [1]. Three different data 
sets with very small sample sizes at varying number of 
predictor variables were studied in an effort to find the best 
estimator that is suitable when dealing with small sample size 
problem. One of the data is on C emission and its possible 
correlates of four countries, while the other two are sampled 
data from some data we found in [1]. The mean squares error 
of each of the estimators described in section two was 
computed for all the seven data sets. Partial least square 
regression offered an almost imperceptible improvement over 
other estimators. The specified results for the estimators are 
not reported here to save space and to focus more on the main 
objective of this 

study. The MSE estimates for all the methods we considered 
are provided in Table I. For each of the seven data sets, the 
OLS regression estimators along with the other estimators 
discussed in section II were ranked according to their MSE 
values with the lowest ranks corresponding to lowest MSE. 
The median of the ranks for each method is given in Table II. 

In this study we made effort to finding the best regression 
estimator in predicting the response variable. Four useful 
estimators for treating multicollinearityr were compared 
together with OLS in each data set. Sample correlations 
between pair of predictor variables of data set 2 in column 3 
of Table I range from 0.421 to 1.000. For PCR estimator, the 
last eigenvalue of X 'X was used in determining the biasing 
parameter of ridge regression since we could not run OLSR of 
some of our data sets due to small nature of our sample size. 
Here efforts were made to select the principal components 
that account for 90% and above of the total variation in the 
original data. In most cases not all the components are 
considered in PLSR estimator. Only those components that 
provide the desired results were considered. 

IV. Discussion 

The OLSR and ridge regression estimators performed quite 
poorly in all the data sets we studied. This comes as no 
surprise because the selected data sets were chosen to study 
the behavior of the estimators when OLSR estimation is 
expected to be deficient. Also since the measure of error is 
used for comparison it is expected that ridge regression 
estimator as a biasing estimator will not be effective in 
predicting Y. In four of the seven data sets studied OLSR 
estimator produces no result. We are not surprise that the rank 
of OLSR is 1 in the first data set in column 2 of Table I 
because it has been said in the literature that if the sole 
purpose of regression analysis is prediction or forecasting, 
then multicollinearity is not a serious problem because the 
higher the R ~ ,the better the prediction. 

We observed that method D, stepwise regression estimator 
(which is often used in deciding the “best” set of predictor 
variables for a regression model) produced fairly good result 
with median rank of 2 as can be seen in table II. This method 
is second to the best methods we studied. 

The method of PLSR is the “best” among all the methods 
we studies with a median rank of 1 . This method came first in 
five of the data sets we studied and took second and third 
position in the remaining two data sets. 

PCR estimator, method B is next to stepwise regressing in 


predicting the response variable of regression analysis with 
the median rank of 3 as can be seen in Table II. 

Among the weakest estimator in predicting Y variable we 
studied is RR estimator. This method took the last position 
among other methods we studied as can be seen in Table II. It 
is good to note that one may get different result when other 
methods (e.g. variance of the distribution) are used in 
determining the biasing parameter of ridge regression. From 
Table I, it is clear that this method can produce misleading 
result when the sample size is very small and less than or 
equal to the number of the predictor variables. See tables 
below. 


Table I: MSE of Different Estimators across Different types of 

Multicollinear Data 


Estim 

ator 

Data 
with 
time as 
predicto 
r 

variable 

Highl 

y 

correl 

ated 

data 

Data 

with 

size 

n=15; 

p=25 

Data 

with 

Differ 

ent 
polyn 
omials 
of X 

Data 
with 
size 
n=p= 6 

Data 

with 

size 

n=4; 

P-6 

Data 

with 

size 

n=5; 

P-6 

OLSR 

0.475 

10308 

5 

- 

0.054 

- 

- 

- 

Stepw 

ise 

0.57 

10272 

7 

0.004 

0.53 

0.004 

15620 

0.33 

PLSR 

0.474 

10308 

8 

0.002 

0.033 

0 

0 

0.001 

PC 

1.059 

10415 

0 

0.015 

2.72 

0.431 

22415 

0.579 

RR 

8.774 

29481 

0 

0.062 

0.075 

Infinity 

-1.665xl0 n -82650 


Table II: Estimators and their Ranks 


Estimators 

Ranks 

Median 

rank 

OLSR 

2,2,2 

- 

Stepwise 

1,2, 2, 2, 3, 4 

2 

PLSR 

1,1, 1,1, 1,2,3 

1 

PC 

3, 3, 3, 4, 5 

3 

RR 

3, 4, 5, 5 

4.5 
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